{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "election = pd.DataFrame({\n",
    "    'County': ['De Witt County', 'Lac qui Parle County', 'Lewis and Clark County',\n",
    "        'St John the Baptist Parish'],\n",
    "    'State': ['IL', 'MN', 'MT', 'LA'],\n",
    "    'Voted': ['97.8', '98.8', '95.2', '52.6']\n",
    "    \n",
    "})\n",
    "census = pd.DataFrame({\n",
    "    'County': ['DeWitt  ', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist'],\n",
    "        'State': ['IL', 'MN', 'MT', 'LA'],\n",
    "    'Population': ['16,798', '8,067', '55,716','43,044']\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "log_entry = r'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]\"GET /stat141/Winter04 HTTP/1.1\" 301 328 \"http://anson.ucdavis.edu/courses\"\"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)\"'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5PgtKFhgwzv9"
   },
   "source": [
    "# String Manipulation\n",
    "\n",
    "There are a handful of basic string manipulation tools that we use a lot when we work with text:\n",
    "\n",
    "- Transform uppercase characters to lowercase (or vice versa).\n",
    "- Replace a substring with another or delete the substring.\n",
    "- Split a string into pieces at a particular character. \n",
    "- Slice a string at specified locations.\n",
    "\n",
    "We show how we can combine these basic operations to clean up the county names data.\n",
    "Remember that we have two tables that we want to join, but the county names are\n",
    "written inconsistently.\n",
    "\n",
    "Let's start by converting the county names to a standard format."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "LVaqzJs1wzwH"
   },
   "source": [
    "## Converting Text to a Standard Format with Python String Methods\n",
    "\n",
    "We need to address the following inconsistencies between the county names in\n",
    "the two tables:\n",
    "\n",
    "*  Capitalization: `qui` versus `Qui`.\n",
    "*   Omission of words: `County` and `Parish` are absent from the `census` table.\n",
    "*  Different abbreviation conventions: `&` versus `and`.\n",
    "*  Different punctuation conventions: `St.` versus `St`.\n",
    "*  Use of whitespace: `DeWitt` versus `De Witt`.\n",
    "\n",
    "When we clean text, it's often easiest to first convert all of the characters to\n",
    "lowercase. It's easier to work entirely with lowercase characters than to try to\n",
    "track combinations of uppercase and lowercase. Next, we want to fix\n",
    "inconsistent words by replacing `&` with `and` and removing `County` and\n",
    "`Parish`. Finally, we need to fix up punctuation and whitespace inconsistencies."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With just two Python string methods, `lower` and `replace`, we can take all of these actions and clean the county names. These are combined into a method called `clean_county`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_county(county):\n",
    "    return (county\n",
    "            .lower()\n",
    "            .replace('county', '')\n",
    "            .replace('parish', '')\n",
    "            .replace('&', 'and')\n",
    "            .replace('.', '')\n",
    "            .replace(' ', ''))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "LVaqzJs1wzwH"
   },
   "source": [
    "Although simple, these methods are the primitives that we can piece together to form more complex string operations. These methods are conveniently defined on all Python strings and do not require importing other modules. Although it is worth familiarizing yourself with [the complete list of string\n",
    "methods](https://docs.python.org/3/library/stdtypes.html#string-methods), we\n",
    "describe a few of the most commonly used methods in {numref}`Table %s <string-methods>`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "CcwHc5QOwzwI"
   },
   "source": [
    ":::{table} String methods\n",
    ":name: string-methods\n",
    "| Method              | Description                                                                 |\n",
    "| ------------------- | --------------------------------------------------------------------------- |\n",
    "| `str.lower()`       | Returns a copy of a string with all letters converted to lowercase          |\n",
    "| `str.replace(a, b)` | Replaces all instances of the substring `a` in `str` with  substring `b` |\n",
    "| `str.strip()`       | Removes leading and trailing whitespace from `str`                          |\n",
    "| `str.split(a)`      | Returns substrings of `str` split at a substring `a`                        |\n",
    "| `str[x:y]`          | Slices `str`, returning indices x (inclusive) to y (not inclusive)          |\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We next verify that the `clean_county` method produces matching county names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'],\n",
       " ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "([clean_county(county) for county in election['County']],\n",
    " [clean_county(county) for county in census['County']])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the county names now have consistent representations, we can successfully join the two tables."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String Methods in pandas\n",
    "\n",
    "In the preceding code, we used a loop to transform each county name. The `pandas` `Series`\n",
    "objects provide a convenient way to apply string methods to each item in the\n",
    "series. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `.str` property on `pandas` `Series` exposes the same Python string methods. Calling a method on the `.str` property calls the method on each\n",
    "item in the series. This allows us to transform each string in the series\n",
    "without using a loop. We save the transformed counties back into their\n",
    "originating tables. First we transform the county names in the election table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "election['County'] = (election['County']\n",
    " .str.lower()\n",
    " .str.replace('parish', '')\n",
    " .str.replace('county', '')\n",
    " .str.replace('&', 'and')\n",
    " .str.replace('.', '', regex=False)\n",
    " .str.replace(' ', ''))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [],
   "source": [
    "census['County'] = (census['County']\n",
    " .str.lower()\n",
    " .str.replace('parish', '')\n",
    " .str.replace('county', '')\n",
    " .str.replace('&', 'and')\n",
    " .str.replace('.', '', regex=False)\n",
    " .str.replace(' ', ''))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also transform the names in the census table so that the two tables contain the same representations of the county names. We can  join these tables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "0O4O2juwwzwP",
    "outputId": "aae014e8-a341-4f05-fe65-7ecf672a24dc"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>County</th>\n",
       "      <th>State</th>\n",
       "      <th>Voted</th>\n",
       "      <th>Population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>dewitt</td>\n",
       "      <td>IL</td>\n",
       "      <td>97.8</td>\n",
       "      <td>16,798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lacquiparle</td>\n",
       "      <td>MN</td>\n",
       "      <td>98.8</td>\n",
       "      <td>8,067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lewisandclark</td>\n",
       "      <td>MT</td>\n",
       "      <td>95.2</td>\n",
       "      <td>55,716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>stjohnthebaptist</td>\n",
       "      <td>LA</td>\n",
       "      <td>52.6</td>\n",
       "      <td>43,044</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             County State Voted Population\n",
       "0            dewitt    IL  97.8     16,798\n",
       "1       lacquiparle    MN  98.8      8,067\n",
       "2     lewisandclark    MT  95.2     55,716\n",
       "3  stjohnthebaptist    LA  52.6     43,044"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "election.merge(census, on=['County','State'])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "Note that we merged on two columns: the county name and the state. We did\n",
    "this because some states have counties with the same name. For example,\n",
    "California and New York both have a county called King.  \n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3595dXFwwzwT"
   },
   "source": [
    "To see the complete list of string methods, we recommend looking at the [Python documentation on `str`\n",
    "methods][py_str] and the [`pandas` documentation for the `.str`\n",
    "accessor][pd_str]. We did the canonicalization task using only\n",
    "`str.lower()` and multiple calls to `str.replace()`. Next, we extract\n",
    "text with another string method, `str.split()`.\n",
    "\n",
    "[py_str]: https://docs.python.org/3/library/stdtypes.html#string-methods\n",
    "[pd_str]: https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting Strings to Extract Pieces of Text\n",
    "\n",
    "Let's say we want to extract the date from the web server's log entry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "remove-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]\"GET /stat141/Winter04 HTTP/1.1\" 301 328 \"http://anson.ucdavis.edu/courses\"\"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)\"'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "log_entry"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]\"GET /stat141/Winter04 HTTP/1.1\"\n",
      "301 328 \"http://anson.ucdavis.edu/courses\"\"Mozilla/4.0 (compatible; MSIE 6.0;\n",
      "Windows NT 5.0; .NET CLR 1.1.4322)\"\n"
     ]
    }
   ],
   "source": [
    "from textwrap import fill\n",
    "print(fill(log_entry, width=79))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "String splitting can help us home in on the pieces of information that form\n",
    "the date. For example, when we split the string on the left bracket, we get two\n",
    "strings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['169.237.46.168 - - ',\n",
       " '26/Jan/2004:10:47:58 -0800]\"GET /stat141/Winter04 HTTP/1.1\" 301 328 \"http://anson.ucdavis.edu/courses\"\"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)\"']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "log_entry.split('[')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second string has the date information, and to get the day, month, and\n",
    "year, we can split that string on a colon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'26/Jan/2004'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "log_entry.split('[')[1].split(':')[0]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To separate out the day, month, and year, we can split on the forward slash. Altogether we split the original string three times, each time keeping only the pieces we are interested in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['26', 'Jan', '2004']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(log_entry.split('[')[1]\n",
    " .split(':')[0]\n",
    " .split('/'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By repeatedly using `split()`, we can extract many of the parts of the log\n",
    "entry. But this approach is complicated---if we wanted to\n",
    "also get the hour, minute, second, and time zone of the activity,\n",
    "we would need to use `split()` six times in total.\n",
    "There's a simpler way to extract these parts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['26', 'Jan', '2004', '10', '47', '58', '-0800']"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "pattern = r'[ \\[/:\\]]' \n",
    "re.split(pattern, log_entry)[4:11]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This alternative approach uses a powerful tool called a regular expression,\n",
    "which we cover in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "default_view": {},
   "name": "pythonstring.ipynb",
   "provenance": [],
   "version": "0.3.2",
   "views": {}
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "88b78e51f3c14ce30a2d9140a12930f3844652caceef2256c35621be48397e78"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
